Python Machine Learning

Chapter 1 - Giving Computers the Ability to Learn from Data

Overview

In this lecture, we cover the following topics:

  • The general concepts of machine learning
  • The three types of learning and basic terminology
  • The building blocks for successfully designing machine learning systems
  • Installing Python packages
In [2]:
from IPython.display import Image

Building intelligent machines to transform data into knowledge

  • Machine learning: a subfield of AI

  • ML refers to algorithms that can derive rules for predictions automatically from data

  • ML is now a very important field of computer science, becoming increasingly important in everyday life.

The three different types of machine learning

In [3]:
Image(filename='./images/01_01.png', width=500)
Out[3]:



Making predictions about the future with supervised learning

In [3]:
Image(filename='./images/01_02.png', width=500)
Out[3]:
  • Supervised Learning: learn a model from labeled training data, that allows us to make predictions about unseen (future) data points

  • The term supervised refers to the fact that the desired output labels of training samples are already known

  • Ex. spam email filtering

Two types:

  • Classification: a supervised learning task with discrete class labels
  • Regression: a supervised learning task where outcome signal is a continuous value



Classification for predicting class labels

In [4]:
Image(filename='./images/01_03.png', width=300)
Out[4]:

Terminology

  • decision boundary
  • positive class
  • negative class

Goal: to predict the categorical class labels of new instances based on past observations

  • Class labels are discrete, unordered values : can be understood as group membership of instances

Types of classification

  • Binary classification: distinguish between two possible classes (e.g. spam and non-spam emails)

  • Multi-class classification: distinguish amongst multiple classes (e.g. handwritten digits from 0 to 9)



Regression for predicting continuous outcomes

In [5]:
Image(filename='./images/01_04.png', width=300)
Out[5]:
  • linear regression: slope and intercept are learnt from data

Goal: to predict continuous outcome

Given:

  • A number of predictor (explanatory) variables (e.g. time spent for studying)
  • A continuous response (outcome) variable (e.g. math SAT score)

We try to find a relationship between those variables that allows us to predict the outcome



Solving interactive problems with reinforcement learning

In [6]:
Image(filename='./images/01_05.png', width=300)
Out[6]:

Goal: to develop a system (agent) that improves its performance based on iteractions with the environment.

The environment gives feedback, including a reward signal, which is not the correct ground truth labels or value, but a measure how well the action was measured by a reward function

Ex. a chess engine. The agent decides upon a series of moves depending on the state of the board (the environment), and the reward can be defined as win or lose at the end of the game.

Discovering hidden structures with unsupervised learning

  • Supervised learning: we know the right label beforehand when we train the model

  • Unsupervised learning: we deal with unlabeled data. We explore the structure of our data to extract mearningful information without the guidance of a known outcome varible or reward function.

Finding subgroups with clustering

In [7]:
Image(filename='./images/01_06.png', width=300)
Out[7]:

Clustering: an exploratory data analysis technique that allows us to organize a pile of information into meaningful subgroups (clusters) without having any prior knowledge of their group memberships.

Ex. discover customer groups based on their interests, in order to develop distinct marketing programs.



Dimensionality reduction for data compression

In [8]:
Image(filename='./images/01_07.png', width=500)
Out[8]:
  • We often work with data of high dimentionality: each observation comes with a high number of measurements
  • A challenge for limited storage space and the computational performance of machine learning algorithms
  • Unsupervised dimensionality reduction is a common approach in feature preprocessing, to remove noise from data (which can also degrade the predictive performance of certain algorithms), and compress the data onto a smaller dimensional subspaces while retaining most of the relevant information
  • A useful vitualization tool: high dim --> 2D, 3D

An introduction to the basic terminology and notations

In [4]:
Image(filename='./images/01_08.png', width=500)
Out[4]:

The Iris dataset: 150 iris flowers from three different species (Setosa, Versicolor, and Virginica)

Matrix Vector A feature (data) matrix X: samples as rows, featues as columnes



A roadmap for building machine learning systems

In [5]:
Image(filename='./images/01_09.png', width=700)
Out[5]:



Preprocessing - getting data into shape

  • One of the most crucial steps
  • In Iris dataset, useful features could be the color, the hue, the intensity of the flowers, the height, and the flower lengths and widths
  • Many ML algorithms require that the selected features are on the same scale, e.g. in the range of [0,1]
  • Some features may be highly correlated and therefore redundant --> dimensionality reduction can be useful for compressing the features onto a lower dimensional subspace
  • Split a dataset into two parts:
    • Training set: to train and optimize our machine learning algorithm
    • Test set: keep it till the very end to evalute the final model

Training and selecting a predictive model

No free lunch theorem David Wolpert, 1996

  • An ML algorithm is designed to perform well on certain tasks, which requires certain assumptions

  • There is no such an universal ML algorithm that performs well on all of the tasks.

  • No single classification model enjoys superiority if we don't make any assumptions about the task.

Therefore, it is essential to compare a handful of different algorithms in order to train and select the best performning model.

  • We need a metric to measure and compare performance of different models, e.g. prediction accuracy

Validation set: a subset of the training data, used for e.g. model selection (recall that we do not touch the test data till the end, for performance evaluation regarding future data points)

  • Used for tuning hyperparameters

Evaluating models and predicting unseen data instances

After we've selected a moddel that has been fitted on the training data, we can use the test dataset to estimate how well it may perform on unseen data to estimate the generalization error

If we're satisfied with the performance, we can use this model to predict new, future data

Important note: feature scaling and dimensionality reduction must be obtained solely from the training dataset, and the same parameters are later applied to transform the test dataset

Using Python for machine learning

One of the most popular language for data science

Python itself it not very fast: interpreter-based

NumPy and SciPy libraries are built using Fortran and C

Installing Python Packages

  • Anaconda: conda install
  • Python: pip install
NumPy >= 1.12.1
SciPy >= 0.19.0
scikit-learn >= 0.18.1
matplotlib >= 2.0.2
pandas >= 0.20.1

Jupyter Notebook

  • conda install jupyter
  • pip install jupyter
$ jupyter notebook
In [ ]: